Problem definition The lockdown and quarantine measures issued worldwide in response to the COVID-19 global pandemic have brought a series of impacts on daily life. Unfortunately, most of the effects of COVID-19 and the resulting lockdown have been shockingly negative - rising deaths, unemployment and the imminent global financial crisis are the most frequently reported problems around the world.

While COVID-19 has brought about many negative effects, it has also brought about some positive changes. For example, crime rates have decreased. From 1.10% (3.1 million) in 2019 to 0.93% (2.6 million) in 2020, the percentage of people aged 12 and older who are victims of violent crime has decreased by 15%. In addition, property crime declines from 7.37% (9.8 million) of households in 2016 to 6.19% (7.9 million) of households in 2020 (Morgan & Thompson, 2022).

This has aroused our interest in crime research. In addition, from February to April 2022, there were several armed robberies in Georgetown community. Therefore, we want to analyze the crimes in the District of Columbia where we study and live in the past five years from 2017 to 2021.

Data extraction For this project, we use crime data for the city of Washington DC, which are available from 2017 onwards on the city’s open data portal. Crime data for city of Washington DC available from their open data portal at: https://opendata.dc.gov. To make analysis manageable, we utilized the data from 2017-2021. We downloaded the DC crime data for 5 years, put them into a folder called “data”.

Because we focus on property crime, after reading the data, we cleaned up the data. We split the time, month and year of the crime according to the original data. It not only lays the foundation for the following visualization, but also facilitates our prediction models.

###data wrangling and pre-processing

#change case of the variables
change_case<-function(x){
  return(gsub("\\b([A-Z])([A-Z]+)", "\\U\\1\\L\\2", x, perl=TRUE))
}
#change case and split date variable to weekdays, month and year, label property offense 
crimes<-crimes%>%
  clean_names()%>%
  dplyr::mutate(shift=change_case(shift),
                method=change_case(method),
                offense=change_case(offense),
                block=change_case(block),
                bid=change_case(bid),
                weekday=lubridate::wday((report_dat), label=TRUE),
                month=lubridate::month((report_dat),label=TRUE),
                year=lubridate::year(report_dat),
                day=lubridate::day(report_dat),
                rep_date=date(report_dat),
                rep_time=hms::as_hms(ymd_hms(report_dat)),
                hours=lubridate::hour(rep_time),
                S_weekday=lubridate::wday((start_date), label=TRUE),
                S_month=lubridate::month((start_date),label=TRUE),
                S_year=lubridate::year(start_date),
                occur_date=date(start_date),
                occur_time=hms::as_hms(ymd_hms(start_date)),
                hour=lubridate::hour(occur_time),
                autotheft=ifelse(offense=="Theft F/Auto", 1, 0))%>%
  mutate(autotheft=as.factor(autotheft))%>%
  as_tibble()

#identify duplication incidents by CCN
crimes%>%
  group_by(ccn)%>%
  dplyr::mutate(count=n())%>%
  filter(count>1)
## # A tibble: 103 × 40
## # Groups:   ccn [50]
##        x     y     ccn report_dat shift method offense block xblock yblock  ward
##    <dbl> <dbl>   <dbl> <chr>      <chr> <chr>  <chr>   <chr>  <dbl>  <dbl> <dbl>
##  1 -77.0  38.9  1.70e7 2017/01/2… Even… Others Theft … 2900… 4.03e5 1.32e5     8
##  2 -77.0  38.9  1.70e7 2017/01/2… Even… Others Theft … 2900… 4.03e5 1.32e5     8
##  3 -77.0  38.9  1.71e7 2017/05/0… Even… Others Theft/… 1300… 3.98e5 1.38e5     2
##  4 -77.0  38.8  1.71e7 2017/05/0… Even… Others Theft/… 1100… 4.01e5 1.30e5     8
##  5 -77.0  38.9  1.71e7 2017/07/3… Midn… Gun    Homici… 4820… 3.98e5 1.42e5     4
##  6 -77.0  38.9  1.71e7 2017/07/3… Midn… Gun    Homici… 4820… 3.98e5 1.42e5     4
##  7 -77.0  38.9  1.70e7 2017/01/1… Midn… Gun    Homici… 600 … 4.01e5 1.37e5     6
##  8 -76.9  38.9  1.80e7 2018/01/0… Midn… Others Sex Ab… 4000… 4.05e5 1.37e5     7
##  9 -76.9  38.9  1.80e7 2018/01/0… Midn… Others Sex Ab… 4000… 4.05e5 1.37e5     7
## 10 -77.0  38.8  1.81e7 2018/05/0… Midn… Gun    Homici… 100 … 4.00e5 1.30e5     8
## # … with 93 more rows, and 29 more variables: anc <chr>, district <dbl>,
## #   psa <dbl>, neighborhood_cluster <chr>, block_group <chr>,
## #   census_tract <chr>, voting_precinct <chr>, latitude <chr>, longitude <chr>,
## #   bid <chr>, start_date <chr>, end_date <chr>, objectid <dbl>,
## #   octo_record_id <lgl>, weekday <ord>, month <ord>, year <dbl>, day <int>,
## #   rep_date <date>, rep_time <time>, hours <int>, S_weekday <ord>,
## #   S_month <ord>, S_year <dbl>, occur_date <date>, occur_time <time>, …
#remove duplicate rows 
crimes_clean<-filter(distinct(crimes, ccn, .keep_all = TRUE))
#check 
crimes_clean%>%
  group_by(ccn)%>%
  dplyr::mutate(count=n())%>%
  filter(count>1)
## # A tibble: 0 × 40
## # Groups:   ccn [0]
## # … with 40 variables: x <dbl>, y <dbl>, ccn <dbl>, report_dat <chr>,
## #   shift <chr>, method <chr>, offense <chr>, block <chr>, xblock <dbl>,
## #   yblock <dbl>, ward <dbl>, anc <chr>, district <dbl>, psa <dbl>,
## #   neighborhood_cluster <chr>, block_group <chr>, census_tract <chr>,
## #   voting_precinct <chr>, latitude <chr>, longitude <chr>, bid <chr>,
## #   start_date <chr>, end_date <chr>, objectid <dbl>, octo_record_id <lgl>,
## #   weekday <ord>, month <ord>, year <dbl>, day <int>, rep_date <date>, …
#overview of crime data
crimes_clean%>%
  group_by(offense)%>%
  dplyr::summarize(count=n())%>%
  arrange(desc(count))
## # A tibble: 9 × 2
##   offense                    count
##   <chr>                      <int>
## 1 Theft/Other                66111
## 2 Theft F/Auto               49585
## 3 Motor Vehicle Theft        13732
## 4 Robbery                    10466
## 5 Assault W/Dangerous Weapon  8389
## 6 Burglary                    6836
## 7 Sex Abuse                   1120
## 8 Homicide                     830
## 9 Arson                         35
#calculate frequency by offense 
crimes_clean%>%
  group_by(offense)%>%
  dplyr::summarise(count=sum(ccn, na.rm = TRUE))%>%
  mutate(freq=100*count/sum(count,na.rm = TRUE))%>%
  arrange(desc(freq))%>%
  ungroup()
## # A tibble: 9 × 3
##   offense                            count    freq
##   <chr>                              <dbl>   <dbl>
## 1 Theft/Other                1253976966604 42.0   
## 2 Theft F/Auto                941366183844 31.5   
## 3 Motor Vehicle Theft         265404975638  8.88  
## 4 Robbery                     199703277813  6.68  
## 5 Assault W/Dangerous Weapon  159873855268  5.35  
## 6 Burglary                    129909452355  4.35  
## 7 Sex Abuse                    21048141098  0.704 
## 8 Homicide                     16166671226  0.541 
## 9 Arson                          673049084  0.0225
#sf theft to join
crimes_sf <- st_as_sf(
  crimes_clean, 
  coords = c("longitude", "latitude"), 
  crs = 4326
)

In the table below, we observe that THEFT comprises about 70% of total crimes in the historical dataset. It represents over observations from 2017 to 2021. This is a sufficiently rich dataset for predictive analysis.

Visulizing Crime and Property Crime in 2017-2021

Visualizing data is a powerful way to gain high-level insight about underlying patterns in the data. Visualization provides useful clues about where we need to investigate further.

#Crime over time from 2017-2021

crimes_clean%>%
  group_by(offense, S_year)%>%
  dplyr::summarise(count = n())%>%
ggplot(aes(x = S_year, y = count)) +
  geom_line(color = "#F2CA27", size = 0.1) +
  geom_smooth(color = "#1A1A1A") 
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 row(s) containing missing values (geom_path).

Firstly, we make a visual analysis of the number of crimes from 2017-2022 committed by DC. According to the figure, we can find that over the past five years, there has been an overall downward trend in the number of crimes committed. In addition, compared with 2019, the number of crimes in 2020 has decreased significantly, which has a certain relationship with the COVID-19.

#Property crimes occur more frequently compared to other crimes

crimes_clean%>%
  group_by(offense)%>%
  dplyr::summarize(count=n())%>%
  arrange(desc(count))
## # A tibble: 9 × 2
##   offense                    count
##   <chr>                      <int>
## 1 Theft/Other                66111
## 2 Theft F/Auto               49585
## 3 Motor Vehicle Theft        13732
## 4 Robbery                    10466
## 5 Assault W/Dangerous Weapon  8389
## 6 Burglary                    6836
## 7 Sex Abuse                   1120
## 8 Homicide                     830
## 9 Arson                         35
crimes_clean %>% 
  group_by(offense) %>%
  dplyr::summarise(count=n())%>%
  ggplot(aes(x = reorder(offense,count), y=count, fill=offense))+
  geom_bar(stat="identity", width = 0.5, show.legend = FALSE)+
  geom_text(aes(label=count), vjust=2.5, size=3, colour = "black")+
  scale_fill_viridis(option = "D", discrete = T, direction = 1, begin = 0.9, end = 0.3) +
  theme_bw()+
  labs(x ="Offense", y = "Number of crimes", title = "Crimes in Washington D.C.") + 
  scale_y_continuous() +
  coord_flip()

Secondly, we want to further study and observe that the number of property crimes in the types of crimes is the largest, of which the largest is theft / other, with a total of 66111 cases, the second most is theft / auto, the third most is motor vehicle theft and the fourth most is robbiery. Theft f/Auto is one of the most common types of theft is theft of valuables from your automobile.

#Property crimes have mostly occurred in October of each year for the last five years

crimes_clean%>%
  group_by(offense, month)%>%
  drop_na(hour,weekday) %>%
  dplyr::summarise(count = n())%>%
  ggplot(aes(x=month, y=count, color=offense))+
  geom_line(aes(group=offense))+
  geom_point()+
  labs(x = "Month of Crime", y = "Number of Crime", title = "Number of Crime in DC from 2017 – 2021, by Report Time of Crime") 

Next, we would like to continue our analysis in the time dimension. With this line graph of the number of crimes and the time line, we can see that the high incidence of both property crimes and other crimes occurs around October every year.

#Crime Over Time

crimes_clean %>%
  group_by(hour,weekday) %>%
  drop_na(hour,weekday) %>%
  dplyr::summarise(count = n()) %>%
  ggplot(aes(x =as.factor(hour), y = weekday, fill = count, stat = "identity"))+
  geom_tile()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.6), legend.title = element_blank(), legend.position="top", legend.direction="horizontal", legend.key.width=unit(2, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm"), panel.margin=element_blank())+
   labs(x = "Hour of Crime", y = "Day of Week of Crime", title = "Number of Crime in DC from 2017 – 2021, by Report Time of Crime") +
  scale_fill_viridis(option = "D")
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
## Warning: `legend.margin` must be specified using `margin()`. For the old
## behavior use legend.spacing

In order to further analyze the time, we use the heat map to subdivide the crime time into 24 hours. By doing this, we can more accurately see that 19-23 o’clock on Friday is the high incidence period of crime. In addition, 22 o’clock on Tuesday is also the high incidence period of crime.

#Certain Types of Crime May Be More Time Dependent

crimes_clean <- crimes_clean %>%
  filter(offense != 'Assault W/Dangerous Weapon') %>%
  filter(offense != 'Sex Abuse') %>%
  filter(offense != 'Arson') %>%
  filter(offense != 'Homicide') %>%
  drop_na(hour,weekday)

time <- crimes_clean %>% 
  select(weekday, hour, offense) %>%
  group_by(weekday, hour, offense)%>%
  dplyr::summarise(count = n())

ggplot(time, aes(x =as.factor(hour), y = weekday, fill = count))+
  geom_tile()+
 theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 4)) +
  labs(x = "Hour of Property Crime", y = "Day of Week of Property Crime", title = "Number of Property Crime in Washington DC from 2017 – 2021, by Category and Report Time of Property Crime") +
  scale_fill_viridis(option = "D") +
  facet_wrap(~ offense, nrow = 2.5)
## Warning: Coercing `nrow` to be an integer.

This graph is good but the gradients aren’t helpful because they are not normalized. We need to normalize the range on each facet.

#Certain types of crime may be more time dependent(normalized).

crimes_clean%>%
    filter(offense %in% crimes_clean$offense[1:157104]) %>%
group_by(offense,weekday,hour) %>%
    drop_na(hour,weekday) %>%
  dplyr::summarise(count = n()) %>%
  mutate(norm = count/sum(count))%>%
  
ggplot(aes(x = hour, y = weekday, fill = norm)) +
  geom_tile() +
  # fte_theme() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.6, size = 4)) +
  labs(x = "Hour of Crime", y = "Day of Week of Crime", title = "Number of Crime in Washington DC from 2017 – 2021, by Category and Report Time") +
  scale_fill_viridis(option = "D") +
  facet_wrap(~ offense, nrow =4.5 )
## Warning: Coercing `nrow` to be an integer.

By subdividing the categories of crimes, we find that theft is the most frequent way of crime, and its high incidence period is 19:00 to 23:00 from Monday to Friday.

Geospatial Visulization of Crime from 2017-2021

We have already looked at the temporal distribution of crimes but crimes can vary considerably with respect to geographies. Typically, within an area there will be pockets or zones which observe higher criminal activity compared to the others, which are referred to as “hot-stops”. Also, the number of police stations around “hot-stops” can be used for testing effective policing. In this part, we will focus more on locating spatial patterns of each crime incident in the DC area.

#load the census tracts of DC 
unzip(zipfile="data/tl_2017_11_tract.zip",
      exdir = "data")
#read shapefile into sf
dc<-st_read("data/tl_2017_11_tract.shp")%>%
  select(GEOID, geometry)%>%
  clean_names()%>%
  st_transform(crs=4326)
## Reading layer `tl_2017_11_tract' from data source 
##   `/Users/Selina/Desktop/final asgn/data/tl_2017_11_tract.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 179 features and 12 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -77.11976 ymin: 38.79164 xmax: -76.90939 ymax: 38.99584
## Geodetic CRS:  NAD83
crimes_sf<-st_as_sf(crimes_clean, coords = c("longitude", "latitude"),crs=4326)

#The distribution of “theft” incidence in DC 2017-2021

#count the property-related "theft" crimes and perform a spatial join
dc_merged_agg<-st_join(dc, crimes_sf, join=st_intersects)%>%
  filter(offense=="Theft F/Auto")%>%
  group_by(geoid)%>%
  dplyr::summarize(count = n())
#create choropleth of the property-related "theft" crimes
count1<-dc_merged_agg%>%
 
  ggplot()+
  geom_sf(aes(fill=count), color = "white", size = 0.1)+
  scale_fill_viridis(option = "D")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.6))

count1

#count the property-related "theft" crimes and perform a spatial join
dc_merged_agg<-st_join(dc, crimes_sf, join=st_intersects)%>%
  filter(offense=="Theft/Other")%>%
  group_by(geoid)%>%
  dplyr::summarize(count = n())
#create choropleth of the count of property-related "theft" crimes
count2<-dc_merged_agg%>%
 
  ggplot()+
  geom_sf(aes(fill=count), color = "white", size = 0.1)+
  scale_fill_viridis(option = "D")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.6))

count2

#The distribution of “Robbery” incidence in DC 2017-2021

#count the property-related "Robbery" crimes and perform a spatial join
dc_merged_agg<-st_join(dc, crimes_sf, join=st_intersects)%>%
  filter(offense=="Robbery")%>%
  group_by(geoid)%>%
  dplyr::summarize(count = n())
#create choropleth of the count of property-related "Robbery" 
count3<-dc_merged_agg%>%
  ggplot()+
  geom_sf(aes(fill=count), color = "white", size = 0.1)+
  scale_fill_viridis(option = "D")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.6))
count3

#The distribution of “Burglary” incidence in District of Colmnbia 2017-2021

#count the property-related "Burglary" crimes and perform a spatial join
dc_merged_agg<-st_join(dc, crimes_sf, join=st_intersects)%>%
  filter(offense=="Burglary")%>%
  group_by(geoid)%>%
  dplyr::summarize(count = n())

#create choropleth of the count of property-related "Burglary" 
count4<-dc_merged_agg%>%
  ggplot()+
  geom_sf(aes(fill=count), color = "white", size = 0.1)+
  scale_fill_viridis(option = "D")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.6))

count4

#Combining the Mapping of different types of “property crime incidence”

According to the definition of “property crime”, we use “filter” to select four different types of property crime: theft F/Auto, Theft/Other, Robbery and Burglary, with five years’(2017-2021) data.

library(ggpubr)
p=ggarrange(count1,count2,count3,count4,labels=c("Theft F/Auto 2017-2021","Theft/Other 2017-2021","Robbery 2017-2021","Burglary 2017-2021",ncol=4,
                                   common.legend=TRUE,legend="none"),hjust = -1,vjust =1.1,font.label =list(size = 9)) 
p

From the spatial patterns, we can see the geographic distribution of each crime incidence. The gradation of color represents the number of crime incidence that took place in a particular area. It is evident that theft(theft F/Auto & theft/Other) happened with higher frequency, which usually take place in the center of DC area. In comparison, there is only a small number of robbery and burglary in the past five years. However, the eastern part of Washington D.C. has a higher level of robbery incidence, and the geographic distribution (hot-stop) of burglary is more dispersive than other types of property crimes.

#The Spatial geographic distribution of Police Stations in District of Columbia

#Identify the Spatial geographic distribution of Police Stations/Universities and Colleges/Metro Lines in District of Columbia

#load the Universities and Colleges of DC 
Universities<-st_read("data/Universities_and_Colleges.shp") 
## Reading layer `Universities_and_Colleges' from data source 
##   `/Users/Selina/Desktop/final asgn/data/Universities_and_Colleges.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 32 features and 13 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -77.09489 ymin: 38.86646 xmax: -76.98459 ymax: 38.97737
## Geodetic CRS:  WGS 84
#load the metro lines of DC 
Metro_lines<-st_read("data/Metro_Lines.shp")
## Reading layer `Metro_Lines' from data source 
##   `/Users/Selina/Desktop/final asgn/data/Metro_Lines.shp' using driver `ESRI Shapefile'
## Simple feature collection with 6 features and 10 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -77.08577 ymin: 38.83827 xmax: -76.91328 ymax: 38.97979
## Geodetic CRS:  WGS 84
#load the police stations of DC 
police_stations<- st_read("data/Police_Stations.shp")
## Reading layer `Police_Stations' from data source 
##   `/Users/Selina/Desktop/final asgn/data/Police_Stations.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 15 features and 12 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -77.07484 ymin: 38.85336 xmax: -76.94289 ymax: 38.96313
## Geodetic CRS:  WGS 84
##load the DC_tracts
DC_tracks <- st_read("data/tl_2017_11_tract.shp")
## Reading layer `tl_2017_11_tract' from data source 
##   `/Users/Selina/Desktop/final asgn/data/tl_2017_11_tract.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 179 features and 12 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -77.11976 ymin: 38.79164 xmax: -76.90939 ymax: 38.99584
## Geodetic CRS:  NAD83
#count the number of crimes and perform a spatial join
dc_merged_agg<-st_join(dc, crimes_sf, join=st_intersects)%>%
  group_by(geoid)%>%
  dplyr::summarize(count = n())

count<-dc_merged_agg%>%
  ggplot()+
  geom_sf(aes(fill=count), color = "white", size = 0.1)+
  scale_fill_viridis(option = "D")+
  geom_sf(data =police_stations,color="red")+
  geom_sf(data =Universities,color="orange")+
  geom_sf(data =Metro_lines,color="white")+
ggtitle("The geographical distribution of crimes \n around particular spots in DC ") +theme_void()

count

The red points on the plot are the police stations. The plot of mapping indicates that most of the police stations are located in and around high-crime (hot-stops) areas, especially those which are located in the middle of the city.

The orange points on the plot are universities and colleges. The plot of mapping indicates that universities located in the north part of DC is much safer than those situated in the middle, since the number of crime incidence are smaller, less than 2000 Cumulative cases in the past five years in total.

The white Metro bus lines go across the most dangerous areas in the center area and outstretch to all directions where crime incidence are less.

#City limits

#DC limits
#load the census tracts of dc
#unzip(zipfile="data/Census_Tracts_in_2020.zip",
      #exdir = "data")
#read shapefile into sf
dc_sf<-st_read("data/Census_Tracts_in_2020.shp")%>%
  clean_names()%>%
  select(geoid, geometry,tract)%>%
  st_transform(crs=4326)
## Reading layer `Census_Tracts_in_2020' from data source 
##   `/Users/Selina/Desktop/final asgn/data/Census_Tracts_in_2020.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 206 features and 315 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -77.11976 ymin: 38.79165 xmax: -76.9094 ymax: 38.99585
## Geodetic CRS:  WGS 84

#Census Tracts variables poverty, pop dens and unemployment #Getting Census Tracts through API

readRenviron("~/.Renviron")
credential<- Sys.getenv("census_api_key")
#find variable code and select variables
v2020 <- load_variables(2020, "acs5", cache = TRUE)
#population B01003_001
#poverty B17017_001
#unemployment B23025_001
#household income B19013_001
#getting variables from acs
options(tigris_use_cache = TRUE)
dc_acs<-as_tibble(get_acs(
  geography = "tract",
  survey="acs5",
  variables = c("B01003_001","B17017_001","B17017_002", "B23025_001",
  "B23025_005", "B19013_001","B06009_001", "B06009_002"),
  state=11,
  county=001,
  geometry=TRUE,
  year=2020,
))
#data wrangling and preprocessing 
#pivot tibbles and rename variables to get demographic features of dc
dc_demo<-dc_acs%>%
  clean_names()%>%
  select(geoid, variable, estimate, geometry)%>%
  pivot_wider(
    names_from = "variable", 
    values_from = "estimate")%>%
  dplyr::rename(population=B01003_001,
                poverty_raw=B17017_001,
                poverty_below=B17017_002,
                employ=B23025_001,
                unemploy=B23025_005,
                income=B19013_001,
                educ=B06009_001,
                belowhigh=B06009_002)%>%
  #handling missing data with average values
  mutate(avg_poverty_rate=poverty_below/poverty_raw, #extract regional average poverty rate
         avg_poverty_rate=ifelse(poverty_raw == 0, 0, poverty_below/poverty_raw),#impute missing data
         avg_unemploy_rate=unemploy/employ,#extract regional average poverty rate
         avg_unemploy_rate=ifelse(employ==0, avg_unemploy_rate, unemploy/employ),#impute missing data
         income=ifelse(is.na(income) | income==0, 98654, income), #median household income for missing data
         population=ifelse(is.na(population), 0, population),
         below_high=belowhigh/educ, #regional below high school educ
         below_high=ifelse(is.na(educ), 0, belowhigh/educ))#impute missing 

#Join census tract data with crime data

data<-st_join(crimes_sf, dc_sf, join=st_intersects)
data<-left_join(data, dc_demo, by="geoid")%>%
  st_set_geometry(NULL)
sample<-data%>%
  select(shift, ward, psa, block_group, 
         weekday, month,year,hours,
         autotheft,population,income, 
         avg_poverty_rate,avg_unemploy_rate,below_high)%>%
  mutate(autotheft=as.factor(autotheft))
write_csv(sample,"sample.csv")

#Simple decision tree model (the only one that succeed)

set.seed(20220502)
split<-initial_split(sample, prop=0.75)
theft_train<-training(split)
theft_test<-testing(split)
theft_rec<-
  recipe(autotheft ~ ., data = theft_train) %>%
  themis::step_downsample(autotheft)%>%
  prep(training=theft_train, retain = TRUE)

sampled_mod <-
  decision_tree() %>%
  set_engine(engine = "rpart") %>%
  set_mode(mode = "classification")
sampled_wf <- workflow() %>%
  add_recipe(theft_rec) %>%
  add_model(sampled_mod) 
#fit the model
sampled_fit <- sampled_wf %>%
  fit(data = theft_train)
#plot a decision tree
rpart.plot::rpart.plot(x = sampled_fit$fit$fit$fit)
## Warning: Cannot retrieve the data used to build the model (model.frame: 找不到对象'..y').
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

predictions <- bind_cols(
  theft_test,
  predict(object = sampled_fit, new_data = theft_test),
  predict(object = sampled_fit, new_data = theft_test, type = "prob")
)
select(predictions, autotheft, starts_with(".pred"))
## # A tibble: 36,682 × 4
##    autotheft .pred_class .pred_0 .pred_1
##    <fct>     <fct>         <dbl>   <dbl>
##  1 0         1             0.361   0.639
##  2 1         0             0.624   0.376
##  3 0         0             0.624   0.376
##  4 0         0             0.624   0.376
##  5 0         0             0.624   0.376
##  6 0         1             0.361   0.639
##  7 0         1             0.361   0.639
##  8 0         0             0.624   0.376
##  9 0         0             0.624   0.376
## 10 1         0             0.624   0.376
## # … with 36,672 more rows
conf_mat(data = predictions,
         truth = autotheft,
         estimate = .pred_class)
##           Truth
## Prediction     0     1
##          0 15824  5093
##          1  8513  7252
yardstick::accuracy(data = predictions,
         truth = autotheft,
         estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.629
yardstick::recall(data = predictions,
       truth = autotheft,
       estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.650
sampled_fit%>%
  extract_fit_parsnip() %>%
  vip(num_features = 10)

#We also tried some Time series Analysis For Time Series Analysis, we choose one type of crime, Auto theft based on 2017 to 2021 data.

#extract day and the number of theft, set interval unit to hours
crimes_theft<-crimes_clean%>%
  select(offense, year, month, day, hours)%>%
  mutate(Date=make_datetime(year, month, day, hours))%>%
  select(offense,Date)%>%
  filter(offense=="Theft F/Auto")%>%
  group_by(Date)%>%
  dplyr::summarise(Theft=n())%>%
  ungroup()
#check the begining and ending of the time frame
head(crimes_theft,10) 
## # A tibble: 10 × 2
##    Date                Theft
##    <dttm>              <int>
##  1 2017-01-01 08:00:00     1
##  2 2017-01-01 09:00:00     2
##  3 2017-01-01 10:00:00     3
##  4 2017-01-01 12:00:00     1
##  5 2017-01-01 13:00:00     1
##  6 2017-01-01 15:00:00     1
##  7 2017-01-01 16:00:00     2
##  8 2017-01-01 17:00:00     1
##  9 2017-01-01 18:00:00     3
## 10 2017-01-01 19:00:00     2
tail(crimes_theft, 20)
## # A tibble: 20 × 2
##    Date                Theft
##    <dttm>              <int>
##  1 2021-12-30 18:00:00     1
##  2 2021-12-30 21:00:00     1
##  3 2021-12-30 22:00:00     1
##  4 2021-12-31 00:00:00     1
##  5 2021-12-31 11:00:00     1
##  6 2021-12-31 12:00:00     1
##  7 2021-12-31 13:00:00     1
##  8 2021-12-31 14:00:00     1
##  9 2021-12-31 15:00:00     1
## 10 2021-12-31 16:00:00     3
## 11 2021-12-31 17:00:00     5
## 12 2021-12-31 18:00:00     3
## 13 2021-12-31 21:00:00     3
## 14 2021-12-31 22:00:00     1
## 15 2021-12-31 23:00:00     2
## 16 2022-01-01 00:00:00     1
## 17 2022-01-01 01:00:00     1
## 18 2022-01-01 02:00:00     1
## 19 2022-01-01 03:00:00     1
## 20 2022-01-01 04:00:00     2

#Cross Validation Time series object will be based on the Battery column and the frequency to be 24 as it is total hour of reported crime for 1 day.

#creat object
theft_ts <- ts(crimes_theft$Theft, frequency = 24)
#plot the number of theft according to date
theft_plot <-crimes_theft %>%
ggplot(aes(x = Date, y = Theft)) +
geom_line(aes(color = "Theft")) +
scale_x_datetime(name = "Date", date_breaks = "5 year") +
scale_y_continuous(breaks = seq(0, 400, 100)) + 
theme_minimal() +
labs(title = "DC Theft Crime", subtitle = "2017 - 2021")
ggplotly(theft_plot)
#decompose time series object 
#try to see the trend and seasonality 
theft_ts_dec <- theft_ts %>%
  tail(365) %>%
  decompose()
theft_ts_dec %>%
  autoplot()

The trend shows some pattern like seasonal, indicating there are other seasonality pattern that have not been caught by the plot. We will create a Multi Seasonal Time Series Object.

# Create MSTS Object
theft_multi <- msts(crimes_theft$Theft, seasonal.periods = c(24, # Daily
                                                            24*7, # Weekly
                                                            24*30)) # Monthly
# Decompose MSTS Object
theft_multi_dec <- theft_multi %>%
  mstl()

theft_multi_dec %>%
  tail(365) %>%
  autoplot()

From the plot above, we can see the trend of the theft Crime is already going smooth. The theft Crime trend itself is decreasing in the last 365 days.

#Seasonality Analysis ##Daily Seasonality These are the plot of Daily Seasonality of Battery Crime in Chicago

# Create a data frame based on MSTS Object
theft_multi_df <- as.data.frame(theft_multi_dec)
p1 <- theft_multi_df %>%
  mutate(day = crimes_theft$Date) %>%
  group_by(day) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
  head(24*2) %>%
  ggplot(aes(x = day, y = seasonal)) +
  geom_point(col = "red") + geom_line(col = "black") +
  theme_minimal()
p2 <- theft_multi_df %>%
  mutate(day = crimes_theft$Date, 
         month = month(crimes_theft$Date, label = T)) %>%
  group_by(month) %>%
  summarise(seasonal = sum(Seasonal24 + Seasonal168 + Seasonal720)) %>%
  head(24*30) %>%
  ggplot(aes(x = month, y = seasonal)) +
  geom_point() + geom_col() +
  theme_minimal()
p1/p2

As we can see from the plot above, The crime tend to increase in the month of Feb, May, July, and Nov. These can be seen as a warning signal to people around the city that the chance of them getting harmed is high around these time.